import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook+pdf+jupyterlab+browser"
The dataset contains up to 27000 bigbasket products. Bigbasket is the largest retail store in India. This notebook performs an analysis explore insights in the dataset.
data = pd.read_csv('BigBasket Products.csv')
data.head()
| index | product | category | sub_category | brand | sale_price | market_price | type | rating | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Garlic Oil - Vegetarian Capsule 500 mg | Beauty & Hygiene | Hair Care | Sri Sri Ayurveda | 220.0 | 220.0 | Hair Oil & Serum | 4.1 | This Product contains Garlic Oil that is known... |
| 1 | 2 | Water Bottle - Orange | Kitchen, Garden & Pets | Storage & Accessories | Mastercook | 180.0 | 180.0 | Water & Fridge Bottles | 2.3 | Each product is microwave safe (without lid), ... |
| 2 | 3 | Brass Angle Deep - Plain, No.2 | Cleaning & Household | Pooja Needs | Trm | 119.0 | 250.0 | Lamp & Lamp Oil | 3.4 | A perfect gift for all occasions, be it your m... |
| 3 | 4 | Cereal Flip Lid Container/Storage Jar - Assort... | Cleaning & Household | Bins & Bathroom Ware | Nakoda | 149.0 | 176.0 | Laundry, Storage Baskets | 3.7 | Multipurpose container with an attractive desi... |
| 4 | 5 | Creme Soft Soap - For Hands & Body | Beauty & Hygiene | Bath & Hand Wash | Nivea | 162.0 | 162.0 | Bathing Bars & Soaps | 4.4 | Nivea Creme Soft Soap gives your skin the best... |
Little info about the dataset is shown in the output above and we can see columns like the category, the product and even a sub category which further helps to classify the produc.We can also see the market price which is the price the product is being sold at big product and the market price which is the price the product is being sold outside of bigproduct. The ratings and the description both tell us how the customers feel about each product.
data = data.set_index('index')
The first thing we do is check for missing values in our dataset. We can see below that not too much data is actually missing, there is one product each missing from the brand and product columns and the rating column is missing 8626 values, description also with 115 descriptions.
data.isnull().sum()
product 1 category 0 sub_category 0 brand 1 sale_price 0 market_price 0 type 0 rating 8626 description 115 dtype: int64
Time has come to decide how to treat these missing values, first is to check for the index of the rows that contain the product and brand missing values and then drop them by index.
For the brand, the missing value falls at the row with index of 9766 while the missing value for the product falls at 14 the row with index of 14364.
print(data[data['brand'].isnull()])
product category sub_category \
index
9766 Food Package - Medium Cleaning & Household Disposables, Garbage Bag
brand sale_price market_price type rating \
index
9766 NaN 50.0 50.0 Aluminium Foil, Clingwrap NaN
description
index
9766 NaN
print(data[data['product'].isnull()])
product category sub_category brand sale_price \
index
14364 NaN Beverages Coffee Cothas Coffee 200.0
market_price type rating \
index
14364 240.0 Ground Coffee 4.2
description
index
14364 Cothas Specialty Blend Coffee and Chicory incl...
To resolve the issue for the missing rating, I'm just gonna drop the rows where the ratings are missing too, dropping the features column means removing an important feature in my dataset and I can't seem to think of any values to impute in because the ratings come from what the customer feels when they use the product< so the best option I think is to just drop those rows.
missing_rating = data[data['rating'].isnull()]
missing_rating.head()
| product | category | sub_category | brand | sale_price | market_price | type | rating | description | |
|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||
| 56 | Soothing Cucumber Facial Scrub With Apricot Seeds | Beauty & Hygiene | Skin Care | TJORI | 299.4 | 499.0 | Face Care | NaN | The soothing feel of cucumber meets the gentle... |
| 60 | Corporate Planner Diary With Premium PU Leathe... | Cleaning & Household | Stationery | Prozo Plus | 399.0 | 399.0 | Notebooks, Files, Folders | NaN | A5 Size (210x150mm) \n192 Pages Premium Natura... |
| 66 | Ayurvedic Anti-Tan Face Pack | Beauty & Hygiene | Skin Care | TJORI | 269.4 | 449.0 | Face Care | NaN | A nourishing face pack that removes tan and br... |
| 69 | Organic Carom Seeds/Ajwain/Om Kalu | Foodgrains, Oil & Masala | Masalas & Spices | Earthon | 72.0 | 72.0 | Whole Spices | NaN | Earthon's Ajwain is Best quality, organically ... |
| 70 | Padded Harness - 3/4 inch, Grey Colour | Kitchen, Garden & Pets | Pet Food & Accessories | Glenand | 840.0 | 840.0 | Pet Collars & Leashes | NaN | These are soft padded harness for your active ... |
missing_index = missing_rating.index
print(missing_index)
Int64Index([ 56, 60, 66, 69, 70, 75, 79, 82, 84,
85,
...
27492, 27496, 27501, 27502, 27506, 27510, 27512, 27515, 27531,
27532],
dtype='int64', name='index', length=8626)
#number =[missing_index - 1 for number in missing_index]
data = data.drop(missing_index)
data.isnull().sum()
product 1 category 0 sub_category 0 brand 0 sale_price 0 market_price 0 type 0 rating 0 description 88 dtype: int64
data = data.drop([14364])
data.isnull().sum()
product 0 category 0 sub_category 0 brand 0 sale_price 0 market_price 0 type 0 rating 0 description 88 dtype: int64
So, I'm just going to leave the desriptions column that way it might not be very useful until much later.
We begin the EDA by checking for which categories are the most popular. The histogram plot shows the count of each unique categories which are 9 in total.
data['category'].nunique()
9
data.head()
| product | category | sub_category | brand | sale_price | market_price | type | rating | description | |
|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||
| 1 | Garlic Oil - Vegetarian Capsule 500 mg | Beauty & Hygiene | Hair Care | Sri Sri Ayurveda | 220.0 | 220.0 | Hair Oil & Serum | 4.1 | This Product contains Garlic Oil that is known... |
| 2 | Water Bottle - Orange | Kitchen, Garden & Pets | Storage & Accessories | Mastercook | 180.0 | 180.0 | Water & Fridge Bottles | 2.3 | Each product is microwave safe (without lid), ... |
| 3 | Brass Angle Deep - Plain, No.2 | Cleaning & Household | Pooja Needs | Trm | 119.0 | 250.0 | Lamp & Lamp Oil | 3.4 | A perfect gift for all occasions, be it your m... |
| 4 | Cereal Flip Lid Container/Storage Jar - Assort... | Cleaning & Household | Bins & Bathroom Ware | Nakoda | 149.0 | 176.0 | Laundry, Storage Baskets | 3.7 | Multipurpose container with an attractive desi... |
| 5 | Creme Soft Soap - For Hands & Body | Beauty & Hygiene | Bath & Hand Wash | Nivea | 162.0 | 162.0 | Bathing Bars & Soaps | 4.4 | Nivea Creme Soft Soap gives your skin the best... |
fig1 = px.histogram(data_frame=data, template='plotly_dark', x='category', title='Most popular categories')
fig1.show()
The cell below shows all the categories
data['category'].unique()
array(['Beauty & Hygiene', 'Kitchen, Garden & Pets',
'Cleaning & Household', 'Gourmet & World Food',
'Foodgrains, Oil & Masala', 'Snacks & Branded Foods', 'Beverages',
'Bakery, Cakes & Dairy', 'Baby Care'], dtype=object)
The next visualization shows plots the average price for each category.
categorr = data.groupby(['category'])['market_price'].mean().reset_index()
categorr.head()
| category | market_price | |
|---|---|---|
| 0 | Baby Care | 548.118952 |
| 1 | Bakery, Cakes & Dairy | 155.518797 |
| 2 | Beauty & Hygiene | 377.969513 |
| 3 | Beverages | 257.758294 |
| 4 | Cleaning & Household | 234.195476 |
fig2 = px.bar(data_frame=categorr, x='category', y='market_price', title='Average market price of each category', template='plotly_dark', color='market_price')
fig2.show()
Here, we can see that the prices for Kitchen, Garden and Pets category are the highest and the Snacks & branded foods have the lowest average prices of all the categories, this is probably true for the rating too, but we can't know except we check(that is going to be checked later)
categorr_sale = data.groupby(['category'])['sale_price'].mean().reset_index()
categorr_sale.head()
| category | sale_price | |
|---|---|---|
| 0 | Baby Care | 482.509153 |
| 1 | Bakery, Cakes & Dairy | 142.417955 |
| 2 | Beauty & Hygiene | 313.029733 |
| 3 | Beverages | 229.664613 |
| 4 | Cleaning & Household | 201.621419 |
Here, we are visualizing the average sale price of each catgory.
fig2 = px.bar(data_frame=categorr_sale, x='category', y='sale_price', title='Average sale price of each category', template='plotly_dark', color='sale_price')
fig2.show()
Hovering on top of each bar shows you the average sale price that has been plotted. we can see its not the Kitchen, Garden and pets catgory as it was before, the baby care category comes at the top this time.
Next, we're comparing the market and sale prices for each category and see how much they differ from each other.
categorr_sale = categorr_sale.join(categorr['market_price'])
categorr_sale.head()
| category | sale_price | market_price | |
|---|---|---|---|
| 0 | Baby Care | 482.509153 | 548.118952 |
| 1 | Bakery, Cakes & Dairy | 142.417955 | 155.518797 |
| 2 | Beauty & Hygiene | 313.029733 | 377.969513 |
| 3 | Beverages | 229.664613 | 257.758294 |
| 4 | Cleaning & Household | 201.621419 | 234.195476 |
categorr_sale = pd.melt(categorr_sale, id_vars=['category'], var_name='prices')
fig3 = px.bar(data_frame=categorr_sale, x='category', color='prices', y='value', template='plotly_dark', barmode='group')
fig3.show()
There isn't much difference between the prices for most of the categories now except for the Kitchen, Garden and pets category, there is an approximately 26% difference in the ale price and market price with the market price being the higher price. If there was enough data, we'd have probably been able to check whether the store was running at a loss or not.
top_data = data.groupby(['category'])['rating'].mean().reset_index()
top_data
| category | rating | |
|---|---|---|
| 0 | Baby Care | 4.023790 |
| 1 | Bakery, Cakes & Dairy | 3.911128 |
| 2 | Beauty & Hygiene | 3.930655 |
| 3 | Beverages | 4.084676 |
| 4 | Cleaning & Household | 3.956667 |
| 5 | Foodgrains, Oil & Masala | 4.062198 |
| 6 | Gourmet & World Food | 3.984410 |
| 7 | Kitchen, Garden & Pets | 3.734715 |
| 8 | Snacks & Branded Foods | 3.983313 |
fig_4 = px.bar(data_frame=top_data, x='category', y='rating', color='rating', title='Average rating of all categories', template='plotly_dark')
fig_4.show()
There isn't really much difference as we can see in the plot, the averge rating for the categories is approximately 4. The highest being beverages at 4.08 and Kitchen, Garden & Pets as the lowest at 3.73.
I wonder how Kitchen, Garden & Pets is the lowest though, considering that it has the highest market_price to sale_price ratio.
top_10 = data.sort_values(by=['sale_price'], ascending=False).head(10)
top_10
| product | category | sub_category | brand | sale_price | market_price | type | rating | description | |
|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||
| 1767 | Good Girl Eau De Parfum For Women | Beauty & Hygiene | Fragrances & Deos | Carolina Herrera | 6660.0 | 7400.0 | Eau De Parfum | 5.0 | Well behaved girls rarely make history...GOOD ... |
| 12193 | Man In Black Eau De Parfum | Beauty & Hygiene | Fragrances & Deos | Bvlgari | 6100.0 | 6100.0 | Eau De Parfum | 4.0 | It opens with accords of natural rum and lumin... |
| 21638 | Casserole Set - Die-Cast, Induction Base, Granite | Kitchen, Garden & Pets | Cookware & Non Stick | Wonderchef | 6000.0 | 6000.0 | Cookware Sets | 5.0 | A reflection of classic and elegance, this mul... |
| 13963 | Black Edition Eau De Parfum For Men | Beauty & Hygiene | Fragrances & Deos | Bentley | 5940.0 | 6600.0 | Eau De Parfum | 1.0 | A truly exquisite scent in a beautifully desig... |
| 17768 | Eau De Toilette For Men | Beauty & Hygiene | Fragrances & Deos | Bentley | 4905.0 | 5450.0 | Eau De Toilette | 5.0 | Gentlemen, make way for Bentley for Men, the f... |
| 13560 | Dog Food - Super Premium, Focus, Puppy, Limite... | Kitchen, Garden & Pets | Pet Food & Accessories | Drools | 4675.0 | 5500.0 | Pet Meals & Treats | 4.6 | Drools Optimum Performance Adult Pet food is a... |
| 3894 | Naturally Pure Olive Oil | Gourmet & World Food | Oils & Vinegar | Fragata | 4600.0 | 4600.0 | Pure, Pomace Olive Oil | 1.0 | Fragata, backed by an experience of over a cen... |
| 27539 | Quista Pro Advanced Whey Protein Formula forti... | Beauty & Hygiene | Health & Medicine | Himalaya | 4500.0 | 4500.0 | Supplements & Proteins | 4.0 | Quista Pro is a whey protein blend that helps ... |
| 2366 | Pro Expert Nutrition Large Breed Puppy (3-18 M... | Kitchen, Garden & Pets | Pet Food & Accessories | Pedigree | 4480.0 | 5600.0 | Pet Meals & Treats | 5.0 | Different dog breeds have different nutritiona... |
| 6452 | Extra Virgin Olive Oil | Gourmet & World Food | Oils & Vinegar | Pietro Coricelli | 4400.0 | 4400.0 | Extra Virgin Olive Oil | 4.7 | Since 1939, the family-owned and managed Pietr... |
The next bar plot will rank the top 10 most expensive products at bigbasket
fig_5 = px.bar(data_frame=top_10, x='product', y='sale_price', color='category', template='plotly_dark')
fig_5.show()
It is seen that the beauty and hygiene category has the top 5 most expensive products with perfumes ranking in the top 4 and a protein supplement product at 5th place, next we have 3 products from Kitchen, Garden AND Pets and the last 2 come from Gourmet and world food.
bottom_10 = data.sort_values(by=['sale_price']).head(10)
bottom_10
| product | category | sub_category | brand | sale_price | market_price | type | rating | description | |
|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||
| 21313 | Serum | Beauty & Hygiene | Hair Care | Livon | 3.0 | 3.0 | Hair Oil & Serum | 2.5 | Instantly Softens and Smoothens Dry, Rough, Ta... |
| 27491 | 50-50 Timepass Salted Biscuits | Snacks & Branded Foods | Biscuits & Cookies | Britannia | 5.0 | 5.0 | Salted Biscuits | 4.2 | Britannia 50-50 Timepass Salted Biscuits are a... |
| 17944 | Fulltoss Tangy Tomato | Gourmet & World Food | Snacks, Dry Fruits, Nuts | Parle | 5.0 | 5.0 | Nachos & Chips | 4.2 | A snacking sensation which brings a little fla... |
| 3446 | Marie Light Biscuits - Active | Snacks & Branded Foods | Ready To Cook & Eat | Sunfeast | 5.0 | 5.0 | Breakfast & Snack Mixes | 4.5 | EAN Code: 8901725114848 Country of origin: In... |
| 22179 | Tiger Elaichi Cream Biscuits | Snacks & Branded Foods | Biscuits & Cookies | Britannia | 5.0 | 5.0 | Glucose & Milk Biscuits | 4.2 | Britannia Tiger Elaichi Cream Biscuits are a t... |
| 26585 | Polo - The Mint With The Hole | Snacks & Branded Foods | Chocolates & Candies | Nestle | 5.0 | 5.0 | Toffee, Candy & Lollypop | 4.4 | Nestle's Polo was launched in India in 1993. P... |
| 22656 | Chewing Gum - Peppermint | Snacks & Branded Foods | Chocolates & Candies | Doublemint | 5.0 | 5.0 | Mints & Chewing Gum | 4.2 | Opportunities to connect with someone new happ... |
| 24672 | Fulltoss Thai Sriracha | Gourmet & World Food | Snacks, Dry Fruits, Nuts | Parle | 5.0 | 5.0 | Nachos & Chips | 4.1 | A snacking sensation which brings a little fla... |
| 6015 | Good Day Butter Cookies | Snacks & Branded Foods | Biscuits & Cookies | Britannia | 5.0 | 5.0 | Cookies | 4.1 | Britannia Good Day Butter Cookies are deliciou... |
| 22073 | Tiger Chocolate Cream Biscuits | Snacks & Branded Foods | Biscuits & Cookies | Britannia | 5.0 | 5.0 | Cream Biscuits & Wafers | 4.2 | Britannia Tiger Chocolate Cream Biscuits will ... |
Now with the least expensive products
fig_6 = px.bar(data_frame=bottom_10, x='product', y='sale_price', color='category', template='plotly_dark')
fig_6.show()
A product from beauty and hygiene seems like an outlier here because the rest come from snacks and world food categories.
Now, for the final part let's create a scatter plot of the market_price and sale_price and check out what relationship exist between them.
fig_7 = px.scatter(data_frame=data, x='sale_price', y='market_price', title='Scatter plot of market and sale price', template='plotly_dark', color='rating')
fig_7.show()
As expected, there is an almost perfect positive correlation between them. an increase in the market_price is accompanied by an obvious increase in the sale_price
With this, we come to the end of the EDA. Suggest visualizations you think should be added, thank you.